NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Guiding long-horizon task and motion planning with vision language models.

Yang, Zhutian; Garrett, Caelan; Kumar, Nishanth; Fox, Dieter; Lozano-Perez, Tomas; Kaelbling, Leslie (June 2025, Proceedings IEEE International Conference on Robotics and Automation)

ision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate goth semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLMTAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%).
more » « less
Free, publicly-accessible full text available June 2, 2026
Sequence-Based Plan Feasibility Prediction for Efficient Task and Motion Planning

Yang, Zhutian; Garrett, Caelan; Lozano-Perez, Tomas; Kaelbling, Leslie; Fox, Dieter (January 2023, Robotics science and systems)

We present a learning-enabled Task and Motion Planning (TAMP) algorithm for solving mobile manipulation problems in environments with many articulated and movable obstacles. Our idea is to bias the search procedure of a traditional TAMP planner with a learned plan feasibility predictor. The core of our algorithm is PIGINet, a novel Transformer-based learning method that takes in a task plan, the goal, and the initial state, and predicts the probability of finding motion trajectories associated with the task plan. We integrate PIGINet within a TAMP planner that generates a diverse set of high-level task plans, sorts them by their predicted likelihood of feasibility, and refines them in that order. We evaluate the runtime of our TAMP algorithm on seven families of kitchen rearrangement problems, comparing its performance to that of non-learning baselines. Our experiments show that PIGINet substantially improves planning efficiency, cutting down runtime by 80\% on problems with small state spaces and 10\%-50\% on larger ones, after being trained on only 150-600 problems. Finally, it also achieves zero-shot generalization to problems with unseen object categories thanks to its visual encoding of objects.
more » « less
Full Text Available
SORNet: Spatial object-centric representations for sequential manipulation

Yuan, Wentao; Paxton, Chris; Desingh, Karthik; Fox, Dieter (January 2022, Conference on Robot Learning)

Full Text Available
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation

Xiang, Yu; Xie, Christopher; Mousavian, Arsalan; Fox, Dieter (January 2021, Conference on Robot Learning CoRL)
null (Ed.)
Full Text Available
RICE: Refining Instance Masks in Cluttered Environments with Graph Neural Networks

Xie, Christopher; Mousavian, Arsalan; Xiang, Yu; Fox, Dieter (January 2021, Conference on Robot Learning)
null (Ed.)
Full Text Available
Unseen Object Instance Segmentation for Robotic Environments

Xie, Christopher; Xiang, Yu; Mousavian, Arsalan; Fox, Dieter (January 2021, IEEE transactions on robotics)
null (Ed.)
Full Text Available
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

Blukis, Valts; Paxton, Chris; Fox, Dieter; Garg, Animesh; Artzi, Yoav (January 2021, In Proceedings of the Conference on Robot Learning (CoRL))

Full Text Available

Search for: All records